Entropy, Mutual Information, KL divergence

Information theory

Information theory is a branch of applied mathematics that revolves around quantifying how much information is present in a signal.

Basics

Consider 2 messages:

A: it is dark tonight
B: it is a blood moon tonight

Message A is plainly uninformative. Almost every night is a dark night. Message B is an unlikely event that gives us more information content. A blood moons occurs every few years.

Therefore we want to quantify information such that more unlikely events have higher information content. The more surprised we are of an event occurring, the more information content.

The self-information of an event is:

I(x) = -log_{2}(p(x)) bits

If I throw a fair coin how surprised will we be when it lands heads?

I(H) = -log_{2}(0.5) = 1 bit

What if our coin is unfair? Consider

p(H) = 0.99, p(T) = 0.01 being probabilities of landing heads and tails respectively.

Then:

I(H) = -log_{2}{0.99} = 0.014 bits

and

I(T) = -log_{2}{0.01} = 0.014 = 6.64 bits

We see that more unlikely events contain more information.

Entropy

If we keep throwing the coin, we generate more information. How much information is produced on average? In other words what is the expected value of information content? That's entropy. Entropy quantifies the amount of uncertainty/surprise/information involved in a random variable/process.

Hence:

H(X) = E[I(X)] = E[-log_2(p(X))] = -\sum_{x \in X} p(x)log_2(p(x))

In our fair coin this would be:

H(X_{fair}) = - (0.5log(0.5) + 0.5log(0.5)) = 1 bit

whereas

H(X_{unfair}) = - (0.99log(0.99) + 0.01log(0.91)) = 0.08 bits

Consider fair n-sides dices:

H(X_{fair_{n}}) = -n*(\frac{1}{n}log(\frac{1}{n})) = -log(\frac{1}{n})

n	information
2	1 bit
4	2 bits
8	3 bits
.	.

Therefore, the more spread out the events in our random variable or process, the more the expected uncertainty, the more the information content. Intuitively, you are surprised less often when you know the coin is almost always heads. Information gained = reducing expected uncertainty = reducing expected information content = less bits required to encode = ....

How does this translate to signals/messages in information theory? For example, a 2-side die (coin) has entropy 1, which means the amount of information needed to specify the value of the coin is 1 bit. Therefore it may be encoded in a message as \{side_1: 0, side_2: 1\}, which is precisely 1 bit. So a message of 3 coin tosses may look like this: 011. In a 4-sided die we encode it as \{side_1 : 00, side_2 : 01, side_3 : 10, side_4: 11 \}. So 3 coin tosses may look like this: 001011.

Notes:

I use $H(X) = H(p(x)) = H(p)$ interchangeably for random variable X with outcomes x_i, and probability distribution x \sim p(x)

Joint and Conditional Entropy

Consider a joint distribution f(X,Y) between X and Y random variables.

The joint entropy may be defined as: H(X, Y) = - \sum_{x \in X} \sum_{y \in Y} p(x, y) log_{2}(p(x, y))

Joint entropy: joint expected uncertainty/joint expected information content.

Similarly the conditional entropy, H(Y|X) is equivalent to:

H(Y|X) = \sum_{x \in X} p(x)H(Y | X = x) = - \sum_{x \in X} \sum_{y \in Y} p(x, y) log_{2}(p(y|x))

Conditional entropy: how much more uncertainty/information content remains in Y having observed X - for H(Y|X)

Notes:

We may apply the chain rule as follows: H(X, Y) = H(X | Y) + H(Y) = H(Y | X) + H(X)
if X and Y are independent then H(Y|X) = H(Y), that is, knowing about X does not change entropy of Y$
H(Y|X) \neq H(X|Y)

Mutual Information

Mutual information is the information gain/uncertainty reduction in X after we know Y, or vice versa.

Note the difference between conditional entropy and mutual information. In conditional entropy it is how much uncertainty remains in X given Y; in mutual information it is how much uncertainty is reduced in X given Y. Looking at the image below, we can clearly see that this is I(X;Y).

Therefore:

I(X;Y) = \sum_{x,y} p(x,y) log_2(\frac{p(x,y)}{p(x)p(y)})

As:

\begin{aligned} I(X;Y) \\= &H(X) - H(X|Y) \\ = &H(Y) - H(Y|X) \\ = & -\sum_{y} p(y)log_2(p(y)) -[- \sum_{x,y} p(x, y) log_{2}(p(y|x))]\\ = & -\sum_{x,y} p(x,y)log_2(p(y)) + \sum_{x,y} p(x, y) log_{2}(\frac{p(x,y)}{p(x)}) \\ =& \sum_{x,y} p(x,y) log_2(\frac{p(x,y)}{p(x)p(y)}) \end{aligned}

Notes:

I(X; X) = H(X) - knowing X gives you all the information about X
Mutual information is symmetric

KL divergence and cross entropy

Recall entropy:

H(p(x)) = H(p) = - \sum_x p(x)log_2(p(x))

Now we want to measure the 'relatedness' between distributions p(x) and q(x). Let p(x) be the true underlying distribution and q(x) our approximate. The cross entropy is defined as:

H(p, q) = \sum p(x) log_2(q(x))

The cross entropy between p(x) and q(x) is the average number of bits (using log_2) needed to encode an event x if we used the approximate distribution x ~ q(x) instead of x ~ p(x).

For example if we are approximating a fair coin toss:

\begin{aligned} & p(x = 0) = 0.5 \\ & p(x = 1) = 0.5 \\ \\ & q(x = 0) = 0.4 \\ & q(x = 1) = 0.6 \\ \end{aligned}

\begin{aligned} H(p, q) & \\ &= -[0.5*log_2(0.4) + 0.5*log_2(0.6)] \\ &= 0.661 + 0.368 \\ &= 1.029 bits \\ H(p, p) & \\ &= H(p) \\ &= -[0.5*log_2(0.5) + 0.5*log_2(0.5)] \\ &= 0.5+ 0.5 \\ &= 1 \end{aligned}

Therefore for H(p, q) we require 1.029 bits to encode an event in x \sim p(x) using distribution q. In comparison for, H(p, p), as expected, we require 1 bit to encode an event in x \sim p(x) using the true distribution p. Therefore 0.029 extra bits are required from distribution q to encode an event in x compared to distribution p.

Therefore:

H(p, q) - H(p, p) = 0.029 bits

\begin{aligned} H(p, q) - H(p, p) & \\ &= -\sum_x p(x) log(q(x)) - (-\sum_x p(x) log(p(x))) \\ &= \sum_x p(x) log(p(x)) - \sum_x p(x) log(q(x)) \\ &= \sum_x p(x) log \frac{p(x)}{q(x)} \end{aligned}

This is the Kullback-Leibler (KL) divergence. By definition the KL-divergence FROM p TO q is:

KL(p || q) = \sum_{x} p(x) log(\frac{p(x)}{q(x)})

Therefore cross entropy between distributions p and q may be written as:

H(p, q) = H(p) + KL(p || q) = \sum_{x} p(x) log_2( q(x))

In summary:

H( p ) - Average number of bits to represent an event.
KL(p || q) - Average number of extra bits to represent an event with q instead of p
H(p, q) - Average number of bits to represent an event with q instead of p

Notes:

Cross entropy is used as a loss function in classification problems.
KL divergence is asymmetric. KL(q || p) \neq KL(p || q).
Mutual information may be written as the KL-divergence FROM joint distribution TO product of marginals:
I(X;Y) = KL(p(x,y) || p(x)p(y))
So pretty much how many extra bits are needed to represent an event from the true distribution p = (joint distribution X,Y) using a distribution q = (product of marginals X,Y) is equivalent to how much information (in bits) is obtained about one random variable (e.g. X) having observed the other (e.g. Y).
If X and Y are independent then KL divergence is 0 as no extra bits are needed. Similarly mutual information is 0 as no information can be obtained after observing an independent random variable.